Fix populate antijoin to use .proj() for correct pending key computation#1405
Conversation
The antijoin that computes pending keys (`key_source - self` in `_populate_direct`, `key_source - self._target` in `jobs.refresh`, and `todo - self` in `progress`) did not project the target table to its primary key before the subtraction. When the target table has secondary (non-PK) attributes, the antijoin fails to match on primary key alone and returns all keys instead of just the unpopulated ones. This caused: - `populate(reserve_jobs=False)`: all key_source entries were iterated instead of just pending ones (mitigated by `if key in self:` check inside `_populate1`, but wasted time on large tables) - `populate(reserve_jobs=True)`: `jobs.refresh()` inserted all keys into the jobs table as 'pending', not just truly pending ones. Workers then wasted their `max_calls` budget processing already-completed entries before reaching any real work. - `progress()`: reported incorrect remaining counts in some cases Fix: add `.proj()` to the target side of all three antijoins so the subtraction matches on primary key only, consistent with how DataJoint antijoins are meant to work. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Thanks for digging into this — the One question on the motivation: I traced through the code and with the current test fixture ( On the CI failures — two issues to fix:
Happy to help get these sorted if you'd like — just let me know. |
… attrs - Fix assertion counts: Experiment.make() inserts fake_experiments_per_subject rows per key, so populate(max_calls=2) produces 10 rows, not 2 - Add test_populate_antijoin_overlapping_attrs: self-contained test with Sensor/ProcessedSensor tables that share secondary attribute names (num_samples, quality), reproducing the exact conditions where the antijoin fails without .proj() - Run ruff-format to fix lint Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…en distributed test - make() only receives PK columns -- fetch source data from Sensor() instead - Use Schema.drop(prompt=False) instead of drop(force=True) - Use decimal types instead of float to avoid portability warnings - Remove test_populate_distributed_antijoin: Experiment non-FK experiment_id degrades job granularity, making the assertion unreliable Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Thanks for the quick response. Excuse the oversight, it was the end of a long day :) I updated everything to reflect the bug. The exact conditions that trigger the bug: The antijoin key_source - self matches on ALL common column names. When key_source returns secondary attributes that share names with the target's secondary attributes, SQL matches on those too. If values Here are the table definitions from our pipeline: @Property When populate() computes pending keys:
Why we designed it this way: In this case, InitialContainer simply converts the Lightning Pose CSV output into a https://github.com/neuroinformatics-unit/movement xarray container (NetCDF). The underlying data is the same — same video, That said — if there's a recommended DataJoint pattern for this kind of multi-stage pipeline, let me know :) |
|
Thanks for the detailed explanation — the bug is clear now and it's a real issue. Caveat: This only triggers when two conditions coincide: (1) a custom User-side fix: You can also resolve this by projecting your custom @property
def key_source(self):
return (LightningPoseOutput & production_models).proj()This strips the secondary attributes before they reach the antijoin, matching the behavior of the default On semantic matching: In DataJoint 2.0, attribute lineage tracking would have caught this. The The |
|
Follow-up on semantic matching: I mentioned that DataJoint 2.0's semantic matching would have caught this, but it turns out it depends on whether lineage tracking is active in your schema. Semantic matching relies on the if not expr1.heading.lineage_available or not expr2.heading.lineage_available:
logger.warning("Semantic check disabled: ~lineage table not found. ...")
return # skips the checkSo if your pipeline's schemas don't have To enable semantic matching on existing schemas: |
Summary
_populate_direct()to useself.proj()in the antijoin that computes pending keysjobs.refresh()to useself._target.proj()when computing new keys for the jobs tableprogress()fallback path to useself.proj()in the remaining countProblem
The antijoin that computes pending keys (
key_source - self) does not project the target table to its primary key before the subtraction. When the target table has secondary (non-PK) attributes, the antijoin fails to match on primary key alone and returns all keys instead of just the unpopulated ones.This causes:
populate(reserve_jobs=False): allkey_sourceentries are iterated instead of just pending ones. Mitigated byif key in self:check inside_populate1(), but wastes time on large tables.populate(reserve_jobs=True):jobs.refresh()inserts all keys into the jobs table as'pending', not just truly pending ones. Workers then waste theirmax_callsbudget processing already-completed entries before reaching any real work — effectively making distributed populate non-functional for partially-populated tables.progress(): reports incorrect remaining counts in the fallback (no common attributes) path.Reproduction
Fix
Add
.proj()to the target side of all three antijoins so the subtraction matches on primary key only:autopopulate.py:406self._jobs_to_do(restrictions) - selfself._jobs_to_do(restrictions) - self.proj()autopopulate.py:704todo - selftodo - self.proj()jobs.py:373key_source - self._targetkey_source - self._target.proj()Test plan
test_populate_antijoin_with_secondary_attrs— verifies pending key count after partial populate (direct mode)test_populate_distributed_antijoin— verifiesjobs.refresh()only creates entries for truly pending keys (distributed mode)🤖 Generated with Claude Code